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^ APPARATUS AND METHOD FOR 

CO CONTROL PROCESSING IN DUAL PATH PROCESSOR 

5 TECHNICAL HELD 

This invention relates to a coxhputcr processor, a method of operating tibie same, and a 
computer program product comprising an instruction set for the compute. 



BACKGROUND 

.10 In order to increase the speed of computer processors, prior art architectures have used 

dual execution paths for executing instnictions. Dual execution path processors can operate 
according to a single instruction multiple data (SIMD) princqile, using parallelism of operations 
to increase processor speed. 

However, despite use of dual execution paths and SIMD processing, there is an ongoing 

1 5 need to increase processor speed. Topical dual execution path processors use two substantially 
identical channels, so that each channel handles both control code and datapath code. While 
known processors support a combination of 32-bit standard encoding and 16-bit ''dense" 
encoding, such schemes suffer from several disadvantages, including a lack of semantic content 
in the few bits available in a 16-bit format 

20 Furthennore, conventional general purpose digital signal processors are not able to match 

application specific algorithms for many pxxiposes, including performing specialized operations 
such as coiavolution. Fast Fourier Transforms, TrdlisATiterbi encoding, correlation, finite 
impulse response filtering, and olher operations. 

25 SUMMARY 

In one embodiment according to the invention, there is provided a computer processor. 
The computer processor comprises: (a) a decode unit for decoding a stream of instruction packets 
from a memory, each instxuction packet comprising a plurality of instructions; (b) a first 
processing channel comprising a plurality of functional units and operable to perform control 

30 processing operations; and (c) a second processing channel comprising a plurality of functional 
units and operable to perform data processing operations; wherein the decode unit is operable to 
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receive an instruction packet and to detect if the instruction packet dejBuacs (i) a plurality of 
control instructions or (ii) a plurality of instructions one or more of which is a data processing 
instruction, and wherein when the decode unit detects that the instruction packet comprises a 
plurality of control instructions said control instructions are supplied to the first processing 
channel for execution in program order. 

In related embodiments, the decode unit of the computer processor may be operable to 
detect an instruction packet comprising three control instructions and control the control process 
to execute each of tibic three control instructions in the order in which they appear in the 
instruction packet. The decode unit may also be opeiable to detect an instruction packet 
containing a plurality of control instiuctions of equal length; or to detect, within an instruction 
packet, a control instruction of a bit length between 18 and 24 bits; and in particular, to detect a 
plurality of control instructions each having a bit lengfli of 2 1 bits. The decode unit may be 
operable to receive arid decode instruction packets of a bit length of 64 bits. 

Jh further related embodiments, the decode unit may be operable to detect when fliere is at 
least one data processing instruction in the instruction packet and. in response ther^o, to cause 
relevant data to be supplied to the data processing channel. The decode unit may also be 
operable to detect fliat the instruction packet comprises at least one daU processing instruction 
and a fiarther instruction selected from one or more of: a memoxy access instruction; a contcol 
instruction; and a data processing instruction. The at least one data processing instruction and 
said fiirth€T instruction maybe executed simultaneously. The second processing chaimel maybe 
dedicated to the performance of data processing operations, and data processing instructions may 
be provided in assembly language. The control processing operations may be performed on 
operands up to a first pre-determined bit width and the data processing operations may be 
performed on data up to a second pre-detennined bit width, the second pre-detcrmined bit width 
being larger than the first pre-d6termined bit width. 

In further related embodiments, the first processing chaxmel may comprise units selected 
fi:om one or more of: a control register file; a control execution unit; a branch execution unit and 
a load/store unit. The second processing chaxmel may comprise a data execution path including a 
configurable data execution unit The second processing channel may also comprise a data 
execution path including a fixed data execution unit In use, one or more of the configurable and 
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fixed data execution units may operate acooiding to single instruction miiltiple data principles, 
The data processing channel naay comprise one or more of a data registw file and a load/store 
unit. A single load/store unit may be accessed by both the control processing channel and the 
data processing channel through respective ports. 

In further related embodiments, the decode unit may bo operable to detect an instruction 
packet comprising at least one data processing instmction, wherein the bit length of the at least 
one data processing instruction is between 30 and 38 bits; for example, lEhe bit length may be 34 
bits. The decode unit may also be operable to detect an instruction packet cornprising a data 
jTOpessixig operation and a memory access instruction. The bit lengfli of said memory access 
instruction maybe, for example, 28 bits. The decode unit may also be operable to detect an 
instruction packet coiiqjrising a data processing instruction and a control processing instruction. 
The control processing instruction may be in C code or a variant thereof. Tbe decode unit may 
also be operable to detect a data processing instruction in assembly language. 

In another embodimrat according to the inventiorx, there is provided a method of 
operating a computer processor i?s*ich comprises first and second processing channels, each 
haying a plurality of fimctional units, wherein tJxe first processing channel is capable of 
performing control processing operations and the second processing channel is capable of 
perfonning data processing operations. The method comprises: (a) receiving a sequence of 
instraction packets fcom a memory, each of said instmction packets comprising a plurality of 
instructions defixxing operations; (b) decoding each instruction packet in turn by deterrxiining if 
the instruction packet defines: (i) a plurality of control instructions; or (ii) at least one data 
processing instruction; and wherein when the decode unit detects that the instruction packet 
comprises a plurality of control instructions, supplying saidplurality of control instructions to 
said first processing channel for execution in the sequence. 

In another embodiment according to the invention, there is provided a computer program 
product comprising program code means for causing a computer to be operated according to the 
preceding method. 

In a further embodiment according to the invention, there is provided a computer program 
codie, comprising a sequence of instructions for causing a computer to be operated according to 
the preceding method. 
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In anothex embodiment according to the invention, there is disclosed an instruction set for 
a cornputer including a first class of ixxstniction packets each comprising a plurality of control 
instructions for execution sequentially and a second class of instruction packets each comprising 
at least a data processing instruction and a further instruction for execution contemporaneously, 
said fiuther instruction being selected from one or more of: a memoiy access instruction; a 
control instruction; and a data processing instruction. 

Additional advantages and novel features of the invention will be set forth in part in the 
description which follows, and in part will become apparent to those skilled in the ait upon 
examination of the foUowixig and the accompanying drawings; or may be learned by practice of 
the invention. 

BlUEF DESCRIPTION OF TEIE DRAWINGS 
For a better understanding of the present invention, and to show how the same maybe 
carried into eflBect, reference will now be made, by way of example only, to the accompanying 
drawings, in which: 

Fig. 1 is a block diagram of an asymmetric dual execution path computer processor, 
according to an epibodiment of the invention; 

Fig. 2 shows exemplary classes of instructions for the processor of Fig. 1, according to an 
embodiment of tiie invention; and 

Fig. 3 is a schematic showing components of a configurable deep execution imit, in 
accordance witii an embodiment of the invention. 

DETAILED DESCRIPTION 
Fig. 1 is a block diagram of an asymmetric dual path computer processor, according to an 
embodiment of the invention. The processor of Fig. 1 divides processing of a single instruction 
stream 100 between two diflferent hardware execution patfis: a control execution path 102, which 
is dedicated to processing control code, and a data execution path 103, which is dedicated to 
processing data code. The data widths, operators, and other characteristics of the two execution 
paths 102, 103 differ according to the different characteristics of control code and datapath code. 
Typically,, control code favors fewer, narrower registers, is difficult to parallelize, is typically (but 
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not exclusively) written in C code or another high-level language, and its code density is 
generally more important than its speed performance. By contrast, datapath code typically favors 
a large file of wide registers, is highly paralleli^able, is written in assembly language, and its 
performance is more importaiit than its code density. In the processor of Fig. 1, the two different 
execution paths 102 and 103 are dedicated to handling the two different types of code, with each 
side having its own architectural register file, such as control register file 104 and data register 
file 105, differentiated by width and number of registers; (he control registers are of nanowcr 
width, by number of bits (in one example, 32-bits), and the data registers arc of wider width (in 
one example, 64-bits). The processor is therefore asymmetric, in that its two execution paths are 
different bit-^widths owing to the fact that they each perform different, specialised functions. 

In the processor of Fig. 1, the instraction stream 100 is made up of a series of instraction 
packets. Each instruction packet supplied is decoded by an instruction decode unit 101 , which 
separates, control instructions Scorn data instructions, as described fiuther below. The control 
execution path 102 handles control-flow operations for the instruction stream, and manages the 
machine's state registers, using a branch unit 106, an execution unit 107, and a load store unit 
108, which in fhis embodiment is shared with the data execution path 103. Only the control side 
of the processor need be viable to a co^^ C++, or Java language, or anoUier hi^- 

level language compiler. Withixi the control side, the operation of branch unit 106 and execution 
unit 107 is in accordance with conventional processor design known to those of ordinary skill in 
the art. 

The data execution path 103 employs SIMD (single instruction multiple data) parallelism, 
in both a fixed execution unit 109 and a configurable deep execution unit 1 10. As will be 
described further below, the configurable deep execution unit 1 10 provides a depth dimension of 
processing, to increase work per instructiori, in addition to the width dimension used by 
conventional SINQ> processors. 

If the decoded instruction defines a control instmction it is applied to the appropriate 
functional unit on the control execution path of the machine (e.g. branch unit 106, execution unit 
107, and load/store unit 108). If the decoded instruction defines an instruction with cither a fixed 
or configurable data processing operation it is supplied to the data processing execution path. 
Within the data instruction part of the instruction packet designated bits indicate whether the 
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mstniction is a fixed or configurable data processiiijg instruction, and in the case of a configurable 
instruction fiiithcr designated bits define configuration infonnation. In dependence on the sub- 
type of decoded data processing instruction, data is supplied to either the fixed or the 
?9^ifig\?rable execution sub-paflis of the data processing path of the machine. 

Herein, '^configurable" sigxiifies the ability to select an operator configuration from 
amongst a plurality of predefined C*pseudo-5tatic") operator configurations, A pseudo-static 
configuration of an operator is effective to cause an operator (i) to jperform a certain type of 
operation, or (u) to be interconnected with associated elements in a certain manner, or (iii) a 
combination of (i) or (ii) above, ti practice, a selected pseudo-static configuration may 
1 0 determine the behavior and interconnectivity of many operator elements at a time. It can also 
control swtching configurations associated with tfie data path. In a prcfcixcd embodiment, at 
least some of the plurality of pseudo-static operator configurations are selectable by an opefatibn- 
code portion of a data processing instruction, as will be illustrated further below. Also in 
accordance with embodiments herein, a '^configurable instruction" allows the performance of 
1 5 customized operations at the level of multibit values; for example, at the level of four or more bit 
multibit values, or at the level of words. 

It is pointed out that both control and data processing instructions, performed on tfieir 
respective different sides of the irxachine, can define memory access (load/store) and basic 
arithmetic operations. The inputs/operands for control operations may be supplied to/fiom the 
control register file 104, whereas tfic data/operands for data processing operations are supphed 
to/fi^om the register file 105. 

la accordance with an embodiment of the invention, at least one input of each data 
processing operation can be a vector. In this respect, the configurable operators and/or switching 
circuitry of the configurable data pack can be regarded as configurable to perform vector 
operations by viiture of the nature of operation performed and/or interconnectivity therebetween. 
For example, a 64-bit vector input to a dkta processing operation may include four 16-bit scalar 
operands. Herein, a 'Vector" is an assembly of scalar operands. Vector arithmetic maybe 
pcrfoimcd on a pluraUty of scalar operands, and may include steering, movement, and 
perxnutatiqn of scalar elements. Not all operands of a vector operation need be vectors; for 
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example, a vector operation may have both a scalar and at least one vector as inputs; wd output a 
result that is either a scalar or a vector. 

Herein, "control instructions" include instructions dedicated to program flow, and bianch 
and address generation; but not data processing. 'Data processing instructions" include 
instructions for logical operations, or arithmetic operations for which at least one input is a 
vector- Data processing instructions may operate on multiple data instructions, for example in 
SIMb processing, or in processing wider, short vectors of data elements. The esscaitial functions 
of control instructions and data instructions j^st mentioned do not overlap; however, a 
commonality is that both types of code have logic and scalar arithmetic capabilities. 

Fig. 2 shows three types of instruction packet for the processor of Fig. 1. Each type of 
instruction packet is 64-bits long. Instruction packet 21 1 is a 3-scalar type, for d^e control 
code, and includes three 21-bit control instractions (c2 1), Instruction paickets 212 and 213 are 
LIW (long instruction word) type, for parallel execution of datapaA code. In tins example each 
instruction packet 212, 213 includes two instructions but different numbers may be included if 
desired. Instruction packet 212 includes a 34-bit data instruction (d34) and a 28-bit memory 
instruction (m28); and is used for parallel execution of data-side arithmetic (die d34 instruction) 
with a data-side load-store operation (the m28 instruction). Memory-class instruction (m28) can 
be read from, or written to, either the control side or the data side of the processor, using 
addresses from the control side, iistniotion packet 213 includes a 34-bit data instruction (d34) 
and a 21-Ht control instruction (c21); and is used for parallel execution of data-side arithmetic 
(the d34 instruction) with a control-side operation (the c21 instraotioh), such as a control-side 
arithmetic, branching, or load-store operation. 

Instructipxi decode unit 101 of the embodiment of Fig. 1 uses the initial identification bits, 
or some other designated identification bits at predetermined bit locations, of each instruction 
packet to detenmne which type of packet is being decoded For example, as shown in Fig, 2, an 
initial bit "r* signifies that an instruction packet is of a scalar contcxjl instruction type, with three 
control instructions; while initial bits "0 T' and "0 0" signify instruction packets of type 212 and 
213, with a data and memory instruction in packet 212 or a data and control instruction in packet 
212. . Haying decoded the i^oitial bits of each instruction packet, the decode unit 101 of Fig. 1 
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passes the instructions of each packet appropriately to either the control execution path 102 or Ae 
data execution path 103, according to the type of instruction packet. 

In order to execute the instruction packets of Fig. 2, the instruction decode unit 1 01 of the 
processor of the embodiment of Fig. 1 fetches program packets fiom memory sequentially; and 
the program packets are executed sequentially. Within an instruction packet, the instructions of 
packet 211 are executed sequentially, with the 21-bit control instruction at the least significant 
end of the 64-bit word being executed first, then the next 21-bit control mstruction, and then the 
21-bit control instruction at the most-significant end. Within instruction packets 212 and 213, 
the instnictipns can be executed simultaneously (althou^ this need not necessarily be the case, 
in embodiments according to the invention). Thus, in the program order of the processor of the 
embpdimOTt of Fig. 1, the program packets are executed sequentially; but instructions within a 
packet can be executed cither sequentially, for packet type 2 1 1 , or simultaneously, for packet 
types 212 and 213. Below, instruction packets of types 212 and 213 arc abbreviated as MD and 
CD-packets respectively (containing one memory and one data instruction; and one control 
instruction and one data instruction, respectively). 

In using 2l7bit control instructions, the embodiment of Fig. 1 overcomes a mmiber of 
disadvantages found in processors having iixstcuctions of other lengths, and in particular 
processors that support a combination of 32-bit standard encoding fox data instructions and IS-bit 
"dense" encoding for control code. In such dual 16/32-bit processors, there is a redundancy 
arising from the use of dual encodings for each instruction, or fte use of two separate decoders 
with a means of switching between encoding schemes by branch, fetch address, or other means. 
This redundancy is removed by using a single 21-bit laxgth for all control instructions, in 
accordance with an embodimc?nt of the invention, Furlhernxore, use of 21 -bit control instructions 
removes disadvantages arising from insufficient semantic content in a 16-bit "dense'' encoding 
scheme. Because of insufficient semantic content, processors using a 16-bit scheme typically 
require some xnix 6f design compromises, such as: use of two-operand destructive operations, 
with corresponding code bloat for copies; use of windowed access to a siibset of the register file, 
with code bloat for spiiyfill or window pointer manipulation; or firequ^t reversion to the 32.bit 
format, because not all operations can be expressed in the very few available opcode bits in a 16- 
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bit foimat. These disadvantages are alleviated by use of 21 -bit control instructions^ in an 
qtnbbdiide^tit of the invention. 

A large variety of instructions may be used, in accordance with an embodiment of the 
invention. For example, instruction signatures maybe any of the following, where C-format, M- 
format, and D-format signify control, memory access, and data format respectively: 



Instruction Signature 


Arguments 


Used By 


insp- 


Instruction has no arguments 


C-fbrmat only 


instr dst 


Instruction has a single destination argument 


C*format only 


insir srcO 


lastruction has a single source argument 


C* or D-fotmat only 


instr dst, srcO 


Instruction has single destination, single source 
argument 


D- and M-foimat 
instructions 


instr dst, srcO, srcl 


lastruction has a single destination argument 
and two source arguments 


C-,D-, andM- 
fermat instructions 



Also in accordaxv:e with one embodiment of the invention, the C-format instructions all 
provide SISD (smgle instruction single data) operation, while the M-format and D-format 
instmctions provide either SISD or SIMD operation. For example, control instructions may 
provide gmcral arithmetic, comparison, and logical instructions; control flow instructions; 
memory loads and store instructiohs; and others. Data instructions may provide general 
arittmietic. shift, logical, and comparison instructions; shuffle, sort, byte extend, and permute 
instructions; linear feedback shift register instructions; and, via the configurable deep execution 
unit 1 10 (described further below), user-defined instructions. Memory instructions may provide 
memory loads and stores; copy selected data registers to control registers; copy broadcast control 
registers to data registers; and immediate to register instructions. 

In accordance with an embodiment of the invention, the processor of Fig. 1 features a 
first, fixed data execution path and a second configurable data execution path. The first data path 
has a fixed SIMD execution unit split into lanes in a similar fashion to conventional SIMD 
processing designs. The second data path has a configurable deep execution unit 110. •T)6Cp 
execution" refers to the ability of a processor to perform multiple consecutive operations on the 
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. data provided by a skigle issued instruction, before returning a result to the register file. One 
example of deep execution is found in the conventional MAC operation (multiply and 
accumulate), which performs two operations (a multiplication and an adxlition)^ on data firom a 
single instruction, and therefore has a depth of ordcsr two. Deep execution is characterized by &e 
number of operands input being equal to the number of results output; or, equivalently. tiie 
valency-in equals the valency-ouL Thus, for example, a conventional two-operand addition, 
which has one result, is not an example of deep execution, because the xxumber of operands is not 
equal to the number of results; whereas convolution. Fast Fourier Transforms, TrellisA^iterbi 
encoding, correlators, finite impulse response filters, and otiicr signal processing algorithms are 
exainples of deep execution. Application-specific digital signal processing (DSP) algorithms do 
perforiii deep executioxx» typically at <he bit level and in a memory-mapped fashion. However, 
conventioxial register-mapped general purpose DSP's do not peifoxm deep execution, instead 
executinig instructions at a depth of order two at most, in the MAC operation. By contrast, the 
processor of Fig. 1 provides a register-mapped general purpose processor that is capable of deep 
execution of dynamically cotifigurable word-level instructions at orders greater than two. In the 
processor of Fig. 1, the nature of the deep execution instruction (the graph of the mathematical 
function to be performed) can be adjusted/customised by configuration information in the 
instruction itself. In the preferred embodiment, fomiat instructions contain bit positions allocated 
to configuration irxformation. To provide Oiis capability, the deep execution unit 110 has 
configurible execution resources, which means that operator modes, interconnections, and 
constants can be iqjloaded to suit each application. Deqp execution adds a depth dimensiori to 
the parallelism of execution, which is orthogonal to the width dimension ojBfered by the earlier 
concepts of SIMD and LIW processing; it tiierefore represents an additional dimension for 
increasing work-per-instruction of a general purpose processor. 

Fig. 3 shows the components of a configurable deep execution unit 310, in accordance 
with an embodiment of the invention. As shown in Fig. 1, the configurable deep execution unit 
1 10 is part of the data execution path 103, and may therefore be instructed by data-side 
instructions from the MD and CD-instmction packets 212 and 213 of Fig. 2. In Fig. 3, an 
instruction 3 14 and operands 3 1 5 are supplied to the deep execution unit 3 1 0 firom instruction 
decode unit 101 and data register file 105 of Fig. 1. A multi-bit configuration code in the 
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instruction 3 14 is used to access a control map 316, which expands the multi-bit code into a 
relatively complex set of configuration signals for configuring operators of the deep execution 
unit. The control map 3 16 nriay, for example, be embodied as a look-up table, in which difFerent 
possible m;ilti-bit codes of the instruction are mapped to different possible operator 
configurations of the deep execution unit. Based on the result of consulting the look-up table of 
the control map 3 16, a cjrossbar interconnect 317 configures a set of operators 3 1 8-321 in 
whatever arrangement is necessary to execute the operator configuration indicated by the multi- 
bit instruction code. The operators may include, for example, a multiply operator 3 1 8, an 
arithmetic logic imit (ALU) operator 319, a state operator 320, or a cross-lane permuter 321. In 
one embodiment, the deep execution tmit contains fifteen operators: one multiply operator 318, 
eight ALU operators 319, four state operators 320, and two cross-lane permuters 321; although 
other numbers of operators are possible. The operands 315 si^plied to the deep execution unit 
may be, for example, two J 6-bit operands; these arc suppliisd to a second crossbar interconnect 
322 which may supply the bperaiads to appropriate operators 3 18-321. The second crossbar 
interconnect 322 also receives a feedback 324 of intermediate results fiom the operator 318-321, 
which may then in turn also be supplied to the appropriate operator 3 1 8-32 1 by the second 
crossbar interconnect 322. A third crossbar interconnect 323 multiplexes the results fiom the 
operators 318-321, and ou^uts a final result 325. Various control signals can be used to 
configure the operators; for example, control map 316 of tibie embodiment of Fig. 3 need not 
necessarily be embodied as & single look-up table, but maybe embodied as a series of two or 
more cascaded look-iq> tables. An entry in flie first look-up table could point from a given multi- 
bit instruction code to a second look-up table, thereby reducing the amount of storage required in 
each look-up table for complex operator configurations. For example, the first look-up table 
could be organized into libraries of configuration categories, so that multiple multi-bit instruction 
codes are grouped together in the first look-up table with each group pointing to a subsequent 
look-up table that provides specific configurations for each multi-bit code of the group. 

In accordance with the embodiment of Fig. 3, tiie operators are advantageously pre- 
configured into various operator classes. In practice, this is achieved by a strategic level of 
hardwiring. An advantage of this approach is that it means that fewer predefined configurations 
need to be stored, and the control circuitry can be simpler. For example, operators 3 18 are jpre- 
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configured to be in the class of multiply operators; operators 3 19 are pre-configured as ALU 
operators; operators 320 are pre-configured as state operators; and operators 321 axe pre- 
configured as cross-lane permuters; and other pre-configured operator classes are possible. 
However, even though the classes of operators are pre-configured, there is run-tiine flexibility for 
instructions to be able to arrange at least: (i) connectivity of the operators within each class; (ii) 
connectivity with operators from the other classes; (iii) connectivity of any relevant switching 
means; for the final arrangement of a specific configuration for implementing a given algorithm. 

A skilled reader will appreciate that, while the foregoing has described what is considered 
to be the Vest mode and where appropriate other modes of performing the invention, the 
invention, should not be limited to specific apparatus configurations or method steps disclosed in 
this dcscrq>tion of the preferred embodiment Those skilled in the art will also recognize that the 
invention has a broad range of applications^ and that tlie embodiments admit of a wide range of 
diflEl^ent implementations and modifications without depaxtixj^ firom the inventive concepts. In 
particular, exemplary bit widths mentioned herein arc not intended to be limiting, nor is the 
arbitrary selection of bit widths referred to as half words, words, long, etc. 

276925 
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